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(57) ABSTRACT 

A method, computer system and article of manufacture for 
optimizing a computer program, the method comprising the 
steps of executing an application program and profiling 
selected loops of the executing program. Characteristics of 
the profiled loops are then compared to corresponding 
predetermined threshold values and the results of the com- 
parison are used to select an optimization to be applied to 
subsequent execution of the selected loops. In a preferred 
embodiment, the optimization is the selection of either a 
parallel version or a serial version of the loop. Further 
embodiments provide for the selection of the number of 
processors for parallel implemented loops and for the selec- 
tion of an unroll factor in serially implemented loops. 
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SYSTEM AND METHOD FOR OPTIMIZING requirements for optimal performance are used on all sub- 
PR OG RAM EXECUTION IN A COMPUTER sequent program executions. This can lead to a problem if 
SYSTEM some of the loops in the program are data dependent, that is 

the choice between serial or parallel execution of the loops 

The present invention relates to a system and method for 5 depends on the data set. In this case the programmer has to 

optimizing program execution and more particularly to resort to dual or multi-path code based on the input data. The 

profiling of computation loops for dynamic optimization programmer thus chooses to execute the parallel version or 

over subsequent runs of the profiled loops. the serial version of the application depending on the input 

BACKGROUND OF THE INVENTION data. This situation results in a familiar problem; the pro- 

_ , . , . .... 10 grammer may parallelize loops that should be run serially. 

Computer systems having multi-processors have allowed « . „ . . ... 

computers to perform multiple tasks at the same time, thus Senall y mD IW™ may aiso be optimized to run faster 

reducing the time necessary to complete an operation. One on ™P™«*s°' systems. In this instance, loop unrolling 

typeofmultiprocessorsystemisasymmetricmultiprocessor ma y be a PP' led r at lhe "JT^ sta 8 e to <f ,er 

(SMP) system. Software designed to run on such systems is „ «ecutmg code. Loop unrolling typically repeats the code in 

normally optimized to take advantage of the multiple pro- 15 ,o0 P 8 ««»«• The "umber of times the 

cessors. Typically, this involves coding parallel loops that code „ ,s r "^ted within the unrolled loop is termed the 

can be executed using a number of threads that will execute Once again one of the deficiencies of current 

the different iterations of the loop in parallel. This process is °P l ™«ers is that once an optimum unroll factor is 

normally termed "paraUelization". For SMP systems, there , n determined .it is used on .all subsequent program executions, 

are several ways to allow a program to use the parallel " J* unro " u fac , ,or "? a ,0 °P 15 *pe"dei« not only on the 

processors. Aprogrammer may use Automatic paraUelization b ° unds of but aU ?, on lbe ma ? hme '^"ensues 

included as a compiler utility or parallelize the code by band which k are . dlfficult 10 ? odel , al ^P 11 ' 

by using directives such as to a custom API, or use thread u l me > he unst,cs a . rc u * d t0 select a lo °P ^ ol1 Thus - 

libraries such as POSIX threads. Automatic paraUelization „ ^ optimization is sUU programmer dependent and does not 

works at the level of loop nests, and can be very effective for B M \ reflect cban 8 es 10 me datasel durm * «ecution of the 

programs that spend most of their time in nested loops. a PP ra l0n ' 

ParaUelization results in a certain amount of overhead. Il " an ° b i ec i of the P resenl invention to obviate and 

This overhead occurs as a result of setting up the parallel mitigate some of these disadvantages, 

environment and subsequently to synchronize at the end of 30 SUMMARY OF THE INVENTION 
the parallel segment. The potential benefits or paraUelization 

can be realized if this overhead is very small compared to the The invention seeks to provide a solution to the problem 

computation performed. Due to unknown loop bounds and of optimizing computation loops which are data dependant, 

the complexity of performance prediction, a compiler may In accordance with this invention there is provided a 

have parallelized some of the loops that will not benefit from 35 me thod for optimizing a computer program, comprising the 

paraUelization. In addition, the user may have parallelized steps of execuling an application program, profiling a loop 

loops that do not benefit from paraUelization because of the of the executing program to determine a parameter for the 

inherent overheads involved. loop> comparing the parameter to a threshold value for the 

Thus, profiling was introduced in an attempt to identify i 00 p an d flagging the loop for applying an optimization on 

and remove such paraUelization in some cases. Profiling is 40 subsequent execution of the loop depending on said com- 

an integral part of understanding and tuning an application paring step. Said method may also be provided wherein said 

for improving program performance and may be used to optimization comprises a serialization of the loop. Further, 

monitor the resource usage during the execution of the sa id methods may be provided wherein said optimization 

application program. A program profile is a characterization comprises a paraUelization of the loop. The above methods 

of the execution of a program. A profile may typicaUy 45 may also be provided wherein the loop includes a parallel 

include the execution time, paging requirements, and cache anc j a serial version, and said serialization is a selection of 

misses for each subprogram in the application. These are the serial version for execution, and said paraUelization is 

typicaUy some of the resources that are monitored using the selection of the paraUel version for execution. The 

program profiling. The resource usage information coUected methods may also be provided wherein said step of profiUng 

by the executing program is then used to fine-tune the 50 includes sampling the loop at a predetermined frequency, 

performance of the subprograms. This fine-tuning can be Said step of profiling may also include measuring an execu- 

done either by the programmer manually or by the compiler. tion time of the loop. Further, the threshold value may 

The profiling option *-p' provided in UNIX® environments include a sequential threshold value and a parallel threshold 

and the 'PDF' option available with the IBM® XL Fortran value, said sequential threshold value for determining an 

compilers are examples of programmer directed profiUng. 55 execution time above which a seriaUy executing loop wiU be 

Using the profiUng information, the compiler can generate paralleUzed, and said parallel threshold value for determin- 

code such that the paths that are executed most often are well ing ^ execution time below which a paraUel execuling loop 

optimized. will ^ seriaUzed. The above methods may also be provided 

This form of profiling works in a multi-pass approach wherein said program executes on a computer system having 

consisting of at least two passes. In the first pass, the 60 a pluraUty of processors, and said optimization comprises 

profiling information is coUected and that information is the selection of a number of processors for execution of the 

used to fine-tune the application for subsequent program loop and also wherein said threshold value is a preferred 

executions. This type of approach is referred to as static execution time for the number of processors selected. Also, 

profiUng because the information gathered during the execu- the above method may further comprise the step of compil- 

tion of the program is used after the program terminates, gs ing the program with a plurality of unroU factors prior to 

There are, however, limitations to the static approach. The execution and wherein said optimization comprises selec- 

program is generally run once before optimization. The tion of one of said unroll factors for the loop. 



04/30/2004, EAST version: 1.4.1 



US 6,341371 Bl 
3 4 

There is also provided a computer system for optimizing threshold value, said sequential threshold value for deter- 
program execution, comprising means for executing an mining an execution time above which a serially executing 
application program; means for profiling a loop of the loop will be parallelized, and said parallel threshold value 
executing program to determine a parameter for the loop; for determining an execution time below which a parallel 
means for comparing the parameter to a threshold value for 5 executing loop will be serialized. There may also be p ro- 
th e loop; and means for applying an optimization on sub- vided the above article of manufacture wherein said com- 
sequent execution of the loop depending on a result of said puter system includes a plurality of processors, and said 
comparing means. The above computer system may also be optimization comprises the selection of a number of pro- 
provided wherein said optimization comprises a serializa- cessors for execution of the loop. The article of manufacture 
tion of the loop. Further, said optimization may also com- 10 may be provided wherein said threshold value is a preferred 
prise a parallelization of the loop. The computer system may execution time for the number of processors selected. And, 
also provided wherein the loop includes a parallel and a the article of manufacture may further comprise computer 
serial version, and said serialization is a selection of the readable program code configured to cause a computer 
serial version for execution, and said parallelization is the system to compile the program with a plurality of unroll 
selection of the parallel version for execution. The computer 15 factors prior to execution and wherein said optimization 
system may also be provided wherein said means for pro- comprises selection of one of said unroll factors for the loop, 
filing includes means for sampling the loop at a predeter- is also provided an article of manufacture compris- 
mined frequency. Said means for profiling may also include mg a compule r usable medium having computer readable 
means for measuring an execution time of the loop. The pr0 gram code embodied therein for optimizing program 
threshold value may also include a sequential threshold 20 execution in a computer system, the computer readable 
value and a parallel threshold value, said sequential thresh- p rogram code in said article of manufacture comprising 
old value for determining an execution time above which a computer readable program code configured to cause a 
serially executing loop will be parallelized, and said parallel compu t er system to execute an application program; corn- 
threshold value for determining an execution time below putcr reac jable program code configured to cause a computer 
which a parallel executing loop will be serialized. Further, 25 syslem to monitor a parameter of a loop of said executing 
the computer system may include a plurality of processors, prog ram; computer readable program code configured to 
and said optimization comprises the selection of a number of cause a com puter system to compare the monitored param- 
processors for execution of the loop. The computer system eler t0 a threshold value for the loop; and computer readable 
may also be provided wherein said threshold value is a pr0 gram code configured to cause a computer system to 
preferred execution time for the number of processors 30 app i y an optimization on subsequent execution of the loop 
selected. And, the computer system may be further comprise depending on a result of said comparison code, 
means for compiling the program with a plurality of unroll 

factors prior to execution and wherein said optimization BRIEF DESCRIPTION OF THE DRAWINGS 

comprises selection of one of said unroll factors for the loop. , . 

™ - , -j » r ~ _ • An embodiment of the invention will now be described by 

There is also provided an article of manufacture compris- 35 , i. ■ 

. j- , . , way of example only, with reference to the accompanying 

ing a computer usable medium having computer readable , 3 . u f . n * I 

j i_ j ■ j 4 i_ drawings in which like numbers refer to like structures: 

program code embodied therein for optimizing program & 

execution in a computer system, the computer readable FIG * 1 » a schematic diagram of a software hierarchy in 

program code in said article of manufacture comprising a g encra l computer system; 

computer readable program code configured to cause a 40 FIG. 2 is a block diagram of a system for dynamically 

computer system to execute an application program, com- optimizing program execution according to a first embodi- 

puter readable program code configured to cause a computer ment of the present invention system; 

system to profile a loop of the executing program to deter- FIG. 3 is a flow diagram of an optimization of the 

mine a parameter for the loop, computer readable program execution of a serially executing loop; 

code configured to cause a computer system to compare the 45 FIG 4 ^ a flow diagram of an optimization of the 

parameter to a threshold value for the loop and computer execution of a parallel executing loop; 

readable program code configured to cause a computer _ . , . , . . 

. a 1 c 1 • ■ *■ FIG. 5 is a graph showing loop execution time versus the 

system to flag the loop for applying an optimization on , c jT 

subsequent execution of the loop depending on said com- number of P™esso« used; 

paring code. The above article of manufacture may also be 50 FIG * 6 13 a block diagram of a system for dynamically 
provided wherein said optimization comprises a serializa- optimizing program execution according to a second 
tion of the loop. Further, the article of manufacture may be embodiment of the present invention system; and 
provided wherein said optimization comprises a paralleliza- FIG. 7 is a flow diagram of an optimization of the 
tion of the loop. There may also be provided an article of execution of a loop using an unroll factor, 
manufacture wherein the loop includes a parallel and a serial 55 nPTATT Fn nF^PRiPTinN OF thp 
version, and said serialization is a selection of the serial ^X^R^S^DIM^ 
version for execution, and said parallelization is the selec- PREFERRED EMBODIMEN Fb 
tion of the parallel version for execution. Further, said Referring to FIG. 1, a block diagram of a typical software 
computer readable program code configured to cause a hierarchy in a computer system is shown generally by 
computer syslem to profile may include computer readable 60 numeral 1. The software hierarchy includes an operating 
program code configured to cause a computer system to system 2, an application program 4 and a runtime 6 for 
sample the loop at a predetermined frequency. Further, said running the application program. In the preferred 
computer readable program code configured to cause a embodiment, the runtime includes an optimizer 8 that inter- 
computer system to profile may include computer readable acts with the executing program 4 to identify the loops that 
program code configured to cause a computer system to 65 should be optimized and marks the identified loops for 
measure an execution time of the loop. Also, the threshold execution according to a suitable optimization scheme, as 
value may include a sequential threshold value and a parallel discussed below. Referring to FIG. 2, a computer system 
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according to a first preferred embodiment of the invention then provides the execution time of the loop to the profile 

for a parallel application is shown by numeral 10. In this analyzer 14. Although in this embodiment, the execution 

embodiment, the system 10 includes a multiple processor time is used as the performance metric, other metrics such 

unit computer 12 such as an SMP computer system, an as network contention, memory contention resource usage 

application program 13 having at least one loop compiled by 5 anc j sucn \& c mav De usc ^ Depending on the metric chosen 

a compiler 17 into both serial and parallel versions and an by me profilc analyzer 14, a corresponding threshold value 

optimizer 8. The optimizer comprises a profile analyzer 14 must also be chosen 

for receiving loop execution information at predetermined A t _ ft " , + M . , 

intervals from a monitoring system 15 and for comparing At block M f profile analy^rU compares the measured 

this monitored information to respective corresponding in execution time (26) to the SEQTHRKTO^ parameter. If 

threshold values 16. The profile analyzer 14 uses the results 10 execution time is greater than SEQTHRESHOLD then 

of this comparison to flag selected loops to run either as a the profile analyzer flags the loop 30 so that it may be run 

serial loop or as a parallel loop 18. It is assumed that the on me SMP system with the parallel version of the loop the 

runtime is capable of executing, on the SMP system, the next time il * executed 42. If the execution time is below 

appropriate version of a loop based on the flag information SEQTHRESHOLD then the loop may continue executing 

associated with the loop. The threshold values for the case 15 serially 32. 

of a parallel application include a serial threshold value Referring to FIG. 4 a flow diagram showing the profiling 

(SEQTHRESHOLD), a parallel threshold value of a parallel executing loop is indicated generally by the 

(PARTHRESHOLD) and a profile frequency numeral 40. This flow diagram is similar to that of FIG. 3, 

(PROFILEFREQ). The SEQTHRESHOLD value specifies with the differences resulting from the fact that the loop to 

the time beyond which a loop that is executing serially 20 be profiled has previously been set up to execute in parallel 

should be converted to run in parallel. If the time to execute 42 as opposed to serially 22. The general decision making 

the loop is greater than SEQTHRESHOLD, then the loop process for a parallel loop is represented by the numeral 40. 

will probably benefit from parallel operation even with the The loop is processed in parallel by the SMP system 24 and 

accompanying overhead time. The default setting is 5 mSec, the execution time 26 is again monitored by the profile 

The PARTHRESHOLD value specifies the time below 25 analyzer 14. The profile analyzer compares the measured 

which a loop that is executing in parallel should be serial- execution time with the PARTHRESHOLD parameter. If the 

ized. If the loop executes faster than the PARTHRESHOLD, time is less than the PARTHRESHOLD then the profiler 

then the benefits of parallel execution may not be realized flags the loop 46 so that the SMP system will run the serial 

due to the overhead time associated with that type of version of the loop the next time it is executed 22. If 

execution. If this threshold is set to 0, then no loops will be 30 execution time is greater than PARTHRESHOLD then the 

changed to operate serially. The default setting is 0.2 mSec. parallel version of the loop will be allowed to execute next 

The PROFILEFREQ value indicates how often each loop ti me it is run 48. 

must be sampled to determine whether to change the loop's The profiler analyzer will examine each loop every no 

mode of execution (from serial to parallel or vice versa). If 35 time it runs, where n is the value assigned to PROFILE- 

this frequency value is set to zero, then all profiling is turned FREQ. As previously mentioned, in this example n can 

off. A frequency value of 1 instructs the profile analyzer to range from 0 to 32. Since there are overheads associated 

monitor each loop every time it is executed. Similarly, a with profiling, it may not be worthwhile to profile every time 

frequency of 2 instructs the profile analyzer to monitor the a loop is executed but rather to profile every tenth time a 

loop on every other execution. The maximum PROFILE- loop is executed. 

FREQ value is generally determined by the bit size of the This method of dynamic profiling can noticeably reduce 

integer used to keep track of the sampling frequency. It has the execution time of a given program. Loops that incorpo- 

been observed that the overhead introduced by the subject rate various data sets are, in effect, optimized according to 

invention is negligible, thus a higher value for the profile each data set. The benefits of using the dynamic compiler 

frequency should reduce this overhead further. 45 can be seen from Table 1. Specfp95 benchmarks were 

Referring to FIG. 3, a flow diagram showing the profiling executed on an 8-processor IBM J-40 computer both with 

of a serially executing loop is indicated generally by the and without dynamic profiling. It may be noted that all 

numeral 20. It is assumed that the selected loop has previ- measured time is in seconds. N/A means the execution time 

ously been determined to run sequentially and is marked as is greater than 5000 (seconds) and the runs were slopped, 

such as indicated at block 22. The loop is then executed The default values as described above were used except for 

serially on the SMP system 24. The monitoring system 15 application 7 where PARTHRESHOLD was set to 0.5. 



TABLE 1 







Serial 






Parallel 










1 


1 


2 


4 


6 


8 


Application 


Compiler 


Processor 


Processor 


Processors 


Processors 


Processors 


Processors 


1 


Compiler 


1124.87 


1152.88 


604.92 


345.39 


280.29 


247.63 




Dyn. Prof 


1125.78 


113456 


605.36 


336.66 


289.12 


24638 


2 


Compiler 


1715.28 


1726.48 


89X73 


495.79 


401.89 


369.14 




Dyn. Prof 


1699.31 


1707.33 


894.02 


488.13 


428.11 


313.91 


3 


Compiler 


728.28 


727J8 


506.24 


464.74 


558.40 


697.22 




Dyn. Prof 


711.05 


728.67 


490.65 


430.37 


536.55 


668.26 


4 


Compiler 


1370.18 


1394.43 


946.64 


776.25 


752.86 


772.66 




Dyn. Prof 


1372.97 


1387.93 


928.40 


748.93 


725.53 


728.77 


5 


Compiler 


861.23 


867.41 


509.64 


337.91 


354.11 


374.57 




Dyp. Prof 


854.64 


865.10 


532.84 


328.05 


316.19 


29759 
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TABLE 1 -continued 







Serial 






Parallel 










1 


1 


2 


4 


6 


8 


Application 


Compiler 


Processor 


Processor 


Processors 


Processors 


Processors 


Processors 


6 


Compiler 


921.79 


98X64 


1684.87 


2960.36 


4691.79 


6845.79 




Dyn. Prof 


919.48 


984.07 


802.14 


720.19 


711.21 


71X40 


7 


Compiler 


916.71 


2374.07 


N/A 


N/A 


N/A 


N/A 




Dyn. Prof 


949.55 


2338.05 


2254.50 


2235.54 


2223.53 


3499.74 


8 


Compiler 


582.02 


1426.54 


N/A 


N/A 


N/A 


N/A 




Dyn. Prof 


523.53 


1581.43 


1496.82 


1524.18 


1505.26 


1739.66 


9 


Compiler 


1163.99 


1160.45 


1183.87 


1140.54 


1160.37 


1993.33 




Dyn. Prof 


1128.93 


1109.92 


1335.40 


1119.20 


1116.75 


1144.52 


10 


Compiler 


905.82 


1089.54 


4328.04 


N/A 


N/A 


N/A 




Dyn. Prof 


907.57 


1265.30 


1315.93 


1333.05 


1370.66 


1569.90 



For the parallel application described in the above 
embodiments, it is assumed that the selected loop is execut- 
ing either in parallel with all the processors, or serially with 20 
only one processor. In a variation of the first embodiment of 
the parallel application described with respect to FIG. 2, the 
profile analyzer may also provide as an output the number of 
processors to be used by the parallelized loop. Since the 
threshold values for parallel and serial execution are previ- 15 
ously defined as PARTHRESHOLD and SEQTHRESHOLD 
respectively, it is possible to interpolate between these 
values to determine the thresholds for an intermediate num- 
ber of processors. A steady decrease in the number of 
processors depending on the loop characteristics may be 30 
more beneficial in improving the performance of the appli- 
cation. Depending on the loop characteristics some loops 
may perform better with a small number of processors, but 
may not perform as well sequentially or with a large number 
of processors. 35 

Referring to FIG. 5, a graph showing the loop execution 
time versus the number of processors is shown generally by 
numeral 50. The PARTHRESHOLD and SEQTHRESH- 
OLD values, chosen here to be 0.2 mSec (5 processors) and 
5.0 mSec (1 processor) respectively, are al indicated on the 40 
graph 50. A line joining these points is termed a "control 
line". The threshold values for two, three and four proces- 
sors are determined by interpolation on the graph 50. If the 
all execution time of a given loop is greater than the 
threshold (above the control line), the number of processors 45 
used will be increased by one on subsequent execution of the 
loop up until the maximum numbers of processors are used. 
Conversely, if the execution time of a given loop is a less 
than the threshold (below the control line), the number of 
processors used will be decreased by one on subsequent 50 
execution of the loop up until only one is being used. For 
example, using the values indicated in FIG. 5 and assuming 
the loop is operating serially and takes 5.8 mSec to execute 
(point A), then next time the loop executes it will use two 
processors. Assume that the loop then takes 4.2 mSec to 55 
operate with two processors (point B). The profile analyzer 
will then flag the loop to run on three processors the next 
time it is executed. At this stage if the loop takes 2.2 mSec 
to operate with three processors (point C) then the profile 
analyzer will flag the loop to run on two processors the next 60 
time it is executed. 

The above embodiment is particularly useful in improving 
the performance of applications in multi-user environments, 
where many users are competing for processors. Thus by 
monitoring the system and user time on each processor, the 65 
profile analyzer may be used to optimize the performance of 
the computer. 



Referring to FIG. 6, a computer system according to a 
second preferred embodiment of the invention for a serial 
application is shown by numeral 60. In this embodiment, the 
system 60 generally has a single processor for executing a 
serial application. The profile analyzer according to the 
present invention may be used to optimize the performance 
of the serial application. This embodiment optimizes the 
unroll factor for loops to exploit the single processor in the 
system. The serial application 62 is compiled by a compiler 
64 to create multiple versions of each loop with different 
unroll factors 66. As before, the profile analyzer 67 receives 
the loop characteristics from a monitoring system 68. Then 
using the most efficient unroll factor, the profile analyzer 67 
selects a version of the loop to execute for a particular data 
set 66. In a process similar to the previously described 
parallel application optimization, the profile analyzer exam- 
ines the performance characteristics of a loop and deter- 
mines which unroll factor to use the next time the loop is 
executed. 

However, in order to determine which unroll factor is best 
for a particular loop it is necessary to execute that loop with 
all the unroll factors. For example, there may be three 
possible unroll factors generated by the compiler. The first 
time a loop is executed it will use the first unroll factor, the 
second time it will use the second unroll factor and the third 
time it will use the third unroll factor. The unroll factor that 
has the best results based on predetermined factors such as 
time, memory usage, or the like, will continue to be used 
until a change in the data set size occurs. At this point the 
process begins again and all three unroll factors are tested 
with the new data set. This is more clearly illustrated in the 
flow diagram of FIG. 7. Therefore, unlike parallel loops 
where the overhead for profiling, as described with respect 
to the first embodiment, is negligible, implementing such a 
profiling according to the second embodiment requires a 
careful cost/benefit analysis. 

The invention may be implemented as an article of 
manufacture comprising a computer usable medium having 
computer readable program code means therein for execut- 
ing the method steps of the invention. Such an article of 
manufacture may include, but is not limited to, CD-ROMs, 
diskettes, tapes, hard drives, and computer RAM or ROM. 
Also, the invention may be implemented in a computer 
system. A computer system may comprise a computer that 
includes a processor and a memory device and optionally, a 
storage device, a video display and/or an input device. 
Moreover, a computer system may comprise an intercon- 
nected network of computers. Computers may equally be in 
stand-alone form (such as the traditional desktop personal 
computer) or integrated into another apparatus (such as a 
cellular telephone). 
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While this invention has been described in relation to 11. The computer system of claim 9 wherein said opti- 

p referred embodiments, it will be understood by those mization comprises a parallelization of the loop, 

skilled in the art that changes in the details of processes and 12. The computer system of claim 11 wherein the loop 

structures may be made without departing from the spirit includes the parallel and the serial version, and said serial- 

and scope of this invention. Many modifications and varia- 5 ization is a selection of the serial version for execution and 

tions are possible in light of the above teaching. Thus, it said parallelization is the selection of the parallel version for 

should be understood thai the above described embodiments execution. 

have been provided by way of example rather than as a 13. The computer system of claim 9 wherein the threshold 

limitation and that the specification and drawing are, value includes a sequential threshold value and a parallel 

accordingly, to be regarded in an illustrative rather than a 1Q threshold value, said sequential threshold value for deter- 

restrictive sense. mining an execution time above which the serially executing 

The embodiments of the invention in which an exclusive i oop will be parallelized, and said parallel threshold value 

property or privilege is claimed are defined as follows: f or determining an execution time below which the parallel 

1. A method for optimizing a computer program, com* executing loop will be serialized. 

prising the steps of: !4 xh e computer system of claim 9 wherein said corn- 
executing an application program; puter S y Ste m includes a plurality of processors, and said 
profiling a loop of the executing program by periodically optimizing of the subsequent execution of a loop comprises 

sampling the loop at a predetermined frequency and selecting the number of processors for execution of the loop. 

measuring the execution time of the loop; 15. jh e computer system of claim 14 wherein said 

comparing the execution time of the loop to a threshold ^ threshold value is a preferred execution time for the number 

value for the loop; and of processors selected, 

optimizing the subsequent execution of the by changing 16. The computer system of claim 9 further comprising 

the loop to one of a serially and a parallel executing means for compiling the program with a plurality of unroll 

loop depending on said comparing step. factors prior to execution; and wherein said optimization 

2. The method of claim 1 wherein said optimization ^ comprises selection of one of said unroll factors for the loop, 
comprises a serialization of the loop. 17. An article of manufacture comprising a computer 

3. The method of claim 1 wherein said optimization usable medium having computer readable program code 
comprises a serialization of the loop from a parallel loop. embodied therein for optimizing program execution in a 

4. The method of claim 3 wherein the loop includes a computer system, the computer readable program code in 
parallel and a serial version, and said serialization is a 3Q said article of manufacture comprising: 

selection of the serial version for execution, and said par- computer readable program code configured to cause a 

allelization is the selection of the parallel version for execu- computer system to execute an application program; 

ti° n - computer readable program code configured to cause a 

5. The method of claim 1 wherein the threshold value computer system to periodically profile a loop of the 
includes a sequential threshold value and a parallel threshold 35 executing program by sampling the loop at a predeter- 
value, said sequential threshold value for determining an mined frequency and measuring the execution time of 
execution time above which the serially executing loop will the loop; 

be parallelized, and said parallel threshold value for deter- computer readable program code configured to cause a 

mining an execution time below which the parallel execut- computer system to compare the execution time of the 

ing loop will be serialized. ^ loop t0 a threshold value for the loop; and 

6. The method of claim 1 wherein said program executes computer readable program code configured to cause a 
on a computer system havmg a plurality of processors and computer system to optimize subsequent execution of 
said optimizing of the subsequent execution of a loop ^ b tQ one of a and a Ud executing 
comprises selecting the number of processors for execution loop depending on a result of said comparing code, 
of the loop. . 45 18. The article of manufacture of claim 17 wherein said 

7. The method of claim 6 wherein said threshold value is optimization comprises a serialization of the loop. 

a preferred execution time for the number of processors 19 The Ktic|c of manufacture of c i aim 17 wherein ^ 

selected. ... - optimization comprises a parallelization of the loop. 

8. The method of claim 1 further comprising the step of 20 The of manufacture 0 f claim 19 wherein the 
compiling the program with a plurality of unroll factors prior $Q { a ^ ^ a ^ verskm> and ^ 
to execution; and wherein said optimization comprises serialization is a selection of the serial version for execution, 
selection of one of said unroll factors for the loop. and said parallelization is th e selection of the parallel 

9. A computer system for optimizing a program execution, yersion for execution . 

comprising: 21. The article of manufacture of claim 17 wherein the 

means for executing an application program; 5S threshold value includes a sequential threshold value and a 

means for periodically profiling a loop of the executing parallel threshold value, said sequential threshold value for 

program by sampling the loop at a predetermined determining an execution time above which the serially 

frequency and measuring the execution time of the executing loop will be parallelized, and said parallel thresh- 

l°op; old value for determining an execution time below which the 

means for comparing the execution time of the loop to a 60 parallel executing loop will be serialized. 

threshold value for the loop; and 22. The article of manufacture of claim 17 wherein said 
means for optimizing a subsequent execution of the loop computer system includes a plurality of processors, and said 
by changing the loop to one of a serially and a parallel optimization comprises the selection of a number of pro- 
executing loop depending on a result of said comparing cessors for execution of the loop. 

means. 65 23. The article of manufacture of claim 22 wherein said 

10. The computer system of claim 9 wherein said opti- threshold value is a preferred execution time for the number 
mization comprises a serialization of the loop. of processors selected. 
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24. The article of manufacture of claim 17 further com- 
prising computer readable program code configured to cause 
a computer system to compile the program with a plurality 
of unroll factors prior to execution; and wherein said opti- 
mization comprises selection of one of said unroll factors for 5 
the loop. 

25. An article of manufacture comprising a computer 
usable medium having computer readable program code 
embodied therein for optimizing program execution in a 
computer system, the computer readable program code in 10 
said article of manufacture comprising: 

(a) computer readable program code configured to cause 
a computer system to execute an application program; 
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(b) computer readable program code configured to cause 
a computer system to monitor a parameter of a loop of 
said executing program; 

(c) computer readable program code configured to cause 
a computer system to compare the monitored parameter 
to a threshold value for the loop; and 

(d) computer readable program code configured to cause 
a computer system to apply an optimization on subse- 
quent execution of the loop by changing the loop to one 
of a serially and parallel executing loop depending on 
a result of said comparison code. 
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